Pipeline package is available on CRI Bioinformatics FTP
The Center for Research Informatics (CRI) provides computational resources and expertise in biomedical informatics for researchers in the Biological Sciences Division (BSD) of the University of Chicago.
As a bioinformatics core, we are actively improving our pipelines and expanding pipeline functions. The tutorials will be updated in a timely manner but may not reflect the newest updates of the pipelines. Stay tuned with us for the latest pipeline release.
If you have any questions, comments, or suggestions, feel free to contact our core at bioinformatics@bsd.uchicago.edu or one of our bioinformaticians.
RNA sequencing (RNA-seq) is a revolutionary approach that uses the next-generation sequencing technologies to detect and quantify expressed transcripts in biological samples. Compared to other methods such as microarrays, RNA-seq provides a more unbiased assessment of the full range of transcripts and their isoforms with a greater dynamic range in expression quantification.
In this tutorial, you will learn how to use the CRI’s RNA-seq pipeline (available on both CRI HPC cluster and GitHub)) to analyze Illumina RNA sequencing data. The tutorial comprises the following Steps:
By the end of this tutorial, you will:
This tutorial is based on CRI’s high-performance computing (HPC) cluster. If you are not familiar with this newly assembled cluster, a concise user’s guide can be found here.
The RNA-seq data used in this tutorial are from CRI-BIO-646-BMB-RKeenan.
In this tutorial, we use the sequencing reads in the project CRI-BIO-646 in mouse as example. The sample information are saved in the file CRI-BIO-646.metadata.txt (see below).
Work Flow
There are six (partial) single-end RNA-seq sequencing libraries will be used as the example dataset In this tutorial. Their respective sample information is described in the metadata table CRI-BIO-646/CRI-BIO-646.metadata.txt.
| Sample | Library | ReadGroup | LibType | Platform | SequencingCenter | Date | Lane | Unit | Flavor | Encoding | Run | Genome | NucleicAcid | Group | Location | Seqfile1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WT01 | WT01 | BK-AA-1_S11_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | WT | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-1_S11_L007_R1_001.fastq.gz |
| WT02 | WT02 | BK-AA-2_S12_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | WT | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-2_S12_L007_R1_001.fastq.gz |
| WT03 | WT03 | BK-AA-3_S13_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | WT | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-3_S13_L007_R1_001.fastq.gz |
| KO01 | KO01 | BK-AA-4_S14_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | KO | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-4_S14_L007_R1_001.fastq.gz |
| KO02 | KO02 | BK-AA-5_S15_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | KO | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-5_S15_L007_R1_001.fastq.gz |
| KO03 | KO03 | BK-AA-6_S16_L007_R355 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 355 | hg38 | rnaseq | KO | CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ | BK-AA-6_S16_L007_R1_001.fastq.gz |
| WT01 | WT01 | BK-AA-1_S11_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | WT | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-1_S11_L007_R1_001.fastq.gz |
| WT02 | WT02 | BK-AA-2_S12_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | WT | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-2_S12_L007_R1_001.fastq.gz |
| WT03 | WT03 | BK-AA-3_S13_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | WT | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-3_S13_L007_R1_001.fastq.gz |
| KO01 | KO01 | BK-AA-4_S14_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | KO | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-4_S14_L007_R1_001.fastq.gz |
| KO02 | KO02 | BK-AA-5_S15_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | KO | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-5_S15_L007_R1_001.fastq.gz |
| KO03 | KO03 | BK-AA-6_S16_L007_R357 | RF | Illumina | GCF | 2018-11-07 | 7 | K00242 | 1x50 | 33 | 357 | hg38 | rnaseq | KO | CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ | BK-AA-6_S16_L007_R1_001.fastq.gz |
We will use SSH (Secure Shell) to connect to CRI’s HPC. SSH now is included or can be installed in all standard operating systems (Windows, Linux, and OS X).
The login procedure varies slightly depending on whether you use a Mac/Unix/Linux computer or a Windows computer.
Connect to the login node of the CRI HPC cluster:
$ ssh -l username@gardner.cri.uchicago.eduType in the password when prompted
Make sure that you replace
usernamewith your login name.
- CAUTION
- THIS PACKAGE IS LARGE, PLEASE DO NOT DOWNLOAD IT TO YOUR HOME DIRECTORY
- USE OTHER LOCATION LIKE /gpfs/data/bioinformatics/username
You should be in your home directory after logging in
$ pwd
/home/usernameInstead of downloading the pipeline package to your local home directory, use other location like /gpfs/data/bioinformatics/username
$ cd /gpfs/data/bioinformatics/username; pwd
/gpfs/data/bioinformatics/usernameOne way to download the pipeline package via wget
$ wget ftp://logia.cri.uchicago.edu/bioinformatics/tutorials/Nov2018/CRI-BIO-646-BMB-RKeenan.tgzOr, copy the package directly from HPC
$ cp /gpfs/data/bioinformatics/shared/tutorials/Nov2018/CRI-BIO-646-BMB-RKeenan.tgz .Uncompress the tarball file
$ tar -zxvf CRI-BIO-646-BMB-RKeenan.tgzChange working directory to pipeline dirctory
$ cd CRI-BIO-646-BMB-RKeenan
$ tree -d -L 4
## ../
## ├── CRI-BIO-646
## │ ├── data
## │ │ ├── 180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10
## │ │ │ └── FastQ
## │ │ └── 180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10
## │ │ └── FastQ
## │ └── references
## │ └── v28_92_GRCh38.p12
## │ └── STAR
## ├── DATA
## ├── SRC
## │ ├── Python
## │ │ ├── lib
## │ │ ├── module
## │ │ └── util
## │ └── R
## │ ├── lib
## │ │ └── 3.5
## │ ├── module
## │ └── util
## └── docs
## ├── IMG
## └── result
##
## 23 directoriesRaw sequencing data files (*.fastq.gz) are located at CRI-BIO-646/data/
$ tree CRI-BIO-646/data/
|-- 180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10
| `-- FastQ
| |-- BK-AA-1_S11_L007_R1_001.fastq.gz
| |-- BK-AA-2_S12_L007_R1_001.fastq.gz
| |-- BK-AA-3_S13_L007_R1_001.fastq.gz
| |-- BK-AA-4_S14_L007_R1_001.fastq.gz
| |-- BK-AA-5_S15_L007_R1_001.fastq.gz
| `-- BK-AA-6_S16_L007_R1_001.fastq.gz
`-- 180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10
`-- FastQ
|-- BK-AA-1_S11_L007_R1_001.fastq.gz
|-- BK-AA-2_S12_L007_R1_001.fastq.gz
|-- BK-AA-3_S13_L007_R1_001.fastq.gz
|-- BK-AA-4_S14_L007_R1_001.fastq.gz
|-- BK-AA-5_S15_L007_R1_001.fastq.gz
`-- BK-AA-6_S16_L007_R1_001.fastq.gzGenome data are located at /gpfs/data/bioinformatics/ReferenceData/cri_rnaseq_2018/vM18_93_GRCm38.p6
$ tree /gpfs/data/bioinformatics/cri_rnaseq_2018_ex/example/references/v28_92_GRCh38.p12
|-- GRCh38_rRNA.bed
|-- GRCh38_rRNA.bed.interval_list
|-- STAR
| |-- Genome
| |-- Log.out
| |-- SA
| |-- SAindex
| |-- chrLength.txt
| |-- chrName.txt
| |-- chrNameLength.txt
| |-- chrStart.txt
| |-- exonGeTrInfo.tab
| |-- exonInfo.tab
| |-- geneInfo.tab
| |-- genomeParameters.txt
| |-- run_genome_generate.logs
| |-- sjdbInfo.txt
| |-- sjdbList.fromGTF.out.tab
| |-- sjdbList.out.tab
| `-- transcriptInfo.tab
|-- genes.gtf
|-- genes.gtf.bed12
|-- genes.refFlat.txt
|-- genome.chrom.sizes
|-- genome.dict
`-- genome.faproject related files (i.e., metadata & configuration file) as used in this tutorial are located under CRI-BIO-646/
$ ls -l CRI-BIO-646/CRI-BIO-646.*
## CRI-BIO-646/CRI-BIO-646.metadata.txt
## CRI-BIO-646/CRI-BIO-646.pipeline.yaml
Here are the first few lines in the configuration example file CRI-BIO-646/CRI-BIO-646.pipeline.yaml
---
pipeline:
flags:
aligners:
run_star: True
quantifiers:
run_featurecounts: True
run_rsem: False
run_kallisto: False
callers:
run_edger: True
run_deseq2: True
run_limma: True
software:
main:
use_module: 0
adapter_pe: AGATCGGAAGAGCGGTTCAG,AGATCGGAAGAGCGTCGTGT
adapter_se: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
fastq_format: 33
genome_assembly: hg38When running on another dataset, you will need to modify these
two filesand the master pipeline script (i.e.,Build_RNAseq.CRI-BIO-646.sh) (as described below) accordingly.For instance, if you would like to turn off the DE analysis tool limma, you can set the respecitve paramter to ‘False’ in configuration file
- run_limma: False
For metadata file, you might pay attendtion on the following settings
- Single End (SE) Library
- Set Flavor column as 1xReadLength (e.g., 1x50)
- Set Seqfile1 column as the file name of the repective sequencing file
- Paired End (PE) Library
- Set Flavor column as 2xReadLength (e.g., 2x50)
- Set Seqfile1 column as the file name of the repective read 1 (R1) sequencing file
- Set an additional column named ‘Seqfile2’ as the file name of the repective read 2 (R2) sequencing file
- Non strand-specific Library
- Set LibType column to NS
- Strand-specific Library
- Inquire the library type from your seuqencing center and set LibType column to FR (the left-most end of the fragment (in transcript coordinates, or the first-strand synthesis) is the first sequenced) or RF (the right-most end of the fragment (in transcript coordinates) is the first sequenced, or the second-strand synthesis). You can read this blog for more details of strand-specific RNA-seq.
Master pipeline script
$ cat Build_RNAseq.CRI-BIO-646.sh
##
##
## ## build pipeline scripts
##
## now=$(date +"%m-%d-%Y_%H:%M:%S")
##
## ## project info
## project="CRI-BIO-646"
## SubmitRNAseqExe="Submit_${PWD##*/}.sh"
## padding="CRI-BIO-646/"
##
## ## command
## echo "START" `date` " Running build_rnaseq.py"
## python3 SRC/Python/build_rnaseq.py \
## --projdir $PWD \
## --metadata $PWD/${padding}$project.metadata.txt \
## --config $PWD/${padding}$project.pipeline.yaml \
## --systype cluster \
## --threads 8 \
## --log_file $PWD/Build_RNAseq.$project.$now.log
##
## ## submit pipeline master script
## echo "START" `date` " Running $SubmitRNAseqExe"
## echo "bash $SubmitRNAseqExe"
##
## echo "END" `date`
Basically, when running on your own dataset, you will need to modify this master pipeline script (i.e.,
Build_RNAseq.CRI-BIO-646.sh) accordingly.For instance, you can change respective parameters as follows.
- project=“PROJECT_AS_PREFIX” (e.g., CRI-BIO-646 which is used as a prefix of metadata file CRI-BIO-646.metadata.txt and configuration file CRI-BIO-646.pipeline.yaml)
- padding=“DIRECTORY_NAME_CONTAINING_PROJECT_DATA” (e.g., CRI-BIO-646 which is the folder name to accommodate metadata file, configuration file, sequencing data folder, and references folder)
Before running, you need to know
The master BigDataScript script can be
ONLY run on a head/entry nodeother than any other compuatation node. But, you can step by step run individual sub-task bash scripts in any computation node interactively.
Generate sub-task scirpts of the pipepline
# load modules
$ module purge;module load gcc udunits R/3.5.0 python/3.6.0; module update
# This step is optional but it will install all necessary R packages ahead.
# In case the pipeline was terminated due to the failure of R package installation later when running the pipeline.
# A successful execution result will show the package versions of all necessary packages from devtools to pheatmap
$ Rscript --vanilla SRC/R/util/prerequisite.packages.R
# create directories and generate all necessary scripts
$ bash Build_RNAseq.CRI-BIO-646.sh
this step will execute SRC/Python/build_rnaseq.py using python3 to generate all sub-task bash scripts and directories according to the provided metadata and configuration files (i.e., CRI-BIO-646/CRI-BIO-646.metadata.txt and CRI-BIO-646/CRI-BIO-646.pipeline.yaml )
$ tree -d RNAseq
|-- Aln
| `-- star
| |-- KO01
| | |-- BK-AA-4_S14_L007_R355
| | `-- BK-AA-4_S14_L007_R357
| |-- KO02
| | |-- BK-AA-5_S15_L007_R355
| | `-- BK-AA-5_S15_L007_R357
| |-- KO03
| | |-- BK-AA-6_S16_L007_R355
| | `-- BK-AA-6_S16_L007_R357
| |-- WT01
| | |-- BK-AA-1_S11_L007_R355
| | `-- BK-AA-1_S11_L007_R357
| |-- WT02
| | |-- BK-AA-2_S12_L007_R355
| | `-- BK-AA-2_S12_L007_R357
| `-- WT03
| |-- BK-AA-3_S13_L007_R355
| `-- BK-AA-3_S13_L007_R357
|-- AlnQC
| |-- picard
| | `-- star
| | |-- KO01
| | | `-- tmp
| | |-- KO02
| | | `-- tmp
| | |-- KO03
| | | `-- tmp
| | |-- WT01
| | | `-- tmp
| | |-- WT02
| | | `-- tmp
| | `-- WT03
| | `-- tmp
| `-- rseqc
| `-- star
| |-- KO01
| | `-- tmp
| |-- KO02
| | `-- tmp
| |-- KO03
| | `-- tmp
| |-- WT01
| | `-- tmp
| |-- WT02
| | `-- tmp
| `-- WT03
| `-- tmp
|-- DEG
| |-- deseq2
| | `-- featurecounts
| | `-- star
| |-- edger
| | `-- featurecounts
| | `-- star
| `-- limma
| `-- featurecounts
| `-- star
|-- LociStat
| `-- featurecounts
| `-- star
| `-- CRI-BIO-646-BMB-RKeenan
|-- PostAna
| |-- clusterprofiler
| | `-- featurecounts
| | `-- star
| | `-- CRI-BIO-646-BMB-RKeenan
| `-- pheatmap
| `-- featurecounts
| `-- star
| `-- CRI-BIO-646-BMB-RKeenan
|-- QuantQC
| `-- featurecounts
| `-- star
|-- Quantification
| `-- featurecounts
| `-- star
| |-- KO01
| |-- KO02
| |-- KO03
| |-- WT01
| |-- WT02
| `-- WT03
|-- RawReadQC
| |-- KO01
| | |-- BK-AA-4_S14_L007_R355
| | | `-- BK-AA-4_S14_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-4_S14_L007_R357
| | `-- BK-AA-4_S14_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| |-- KO02
| | |-- BK-AA-5_S15_L007_R355
| | | `-- BK-AA-5_S15_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-5_S15_L007_R357
| | `-- BK-AA-5_S15_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| |-- KO03
| | |-- BK-AA-6_S16_L007_R355
| | | `-- BK-AA-6_S16_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-6_S16_L007_R357
| | `-- BK-AA-6_S16_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| |-- WT01
| | |-- BK-AA-1_S11_L007_R355
| | | `-- BK-AA-1_S11_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-1_S11_L007_R357
| | `-- BK-AA-1_S11_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| |-- WT02
| | |-- BK-AA-2_S12_L007_R355
| | | `-- BK-AA-2_S12_L007_R1_001_fastqc
| | | |-- Icons
| | | `-- Images
| | `-- BK-AA-2_S12_L007_R357
| | `-- BK-AA-2_S12_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| `-- WT03
| |-- BK-AA-3_S13_L007_R355
| | `-- BK-AA-3_S13_L007_R1_001_fastqc
| | |-- Icons
| | `-- Images
| `-- BK-AA-3_S13_L007_R357
| `-- BK-AA-3_S13_L007_R1_001_fastqc
| |-- Icons
| `-- Images
`-- shell_scriptsExecute the entire analysis with just one command
# This will start to run the entire pipeline.
# You can chekc teh BDS report to know the running status.
$ bash Submit_CRI-BIO-646-BMB-RKeenan.sh
Submit_CRI-BIO-646-BMB-RKeenan.bds, so you will need to run this command on a head/entry node.Check running status
- $ qstat -a
- tail *.bds.log
Considering the environment setting in the CRI HPC system, BigDataScript was used as a job management system in the current development to achieve an automatic pipeline. It can handle the execution dependency of all sub-task bash scripts and resume from a failed point, if any.
After the completion of the entire pipeline, you will see a BigDataScript report in HTML under the pipeline folder. For instance, this is the report from one test run. The graphic timeline will tell you the execution time per sub-task script.
[Last Updated on 2018/11/07] | Top